Statistical Inference V

Chelsea Parlett-Pelleriti

Hypothesis Testing

Hypothesis Testing

So far we’ve talked about inference through the lens of Parameter Estimation.

  • Parameter Estimation: inference methods that use data to determine the value of population parameters (e.g. a regression coefficient, group mean, mean difference, proportion…)

  • Hypothesis Testing: inference methods that use data to support a particular theory/hypothesis

Hypothesis Testing

While “Bayes vs. Frequentist?” is an important question to ask when choosing statistical tools, “Parameter Estimation vs. Hypothesis Testing” is even more important.

Parameter Estimation Hypothesis Testing
Bayesian
Frequentist

Hypothesis Testing

My (Chelsea’s) Personal Claim: People often misuse Hypothesis testing in situations where their questions are better answered by Parameter Estimation.

Hypotheses

  • My mean crossword time is faster than yours (\(\mu_{me} \lt \mu_{you}\))

  • There is no effect of corgi height on corgi weight is (\(\beta_1 = 0\))

  • Drug A’s reduction in cold symptoms is equivalent to Drug B’s (\(\mu_{A} = \mu_{B}\))

  • There is no difference in the mean anxiety of Joy Group A and Joy Group B (\(\mu_{A} = \mu_{B}\))

Parameter Estimation

  • The estimate of my mean crossword time is \(25.89 \pm 2\)

  • The regression coefficient of corgi height on corgi weight is between \([-0.1, 0.26]\)

  • The mean difference between the reduction in cold symptoms for groups A and B is \(1.22 (0.01,2.45)\)

  • The mean difference between the anxiety ratings for groups A and B is \(-0.2\) with a standard error of \(0.05\)

Null Hypothesis Significance Testing

Reductio Ad Absurdum

Note: NHST is NOT the only way to be a Frequentist. Often, critiques of Frequentism are actually critiques of NHST.


Reductio Ad Absurdum: (Reduction to Absurdity) is a form of proof by contradiction. We want to prove X, assume not X, show that it leads to a false, ridiculous, or highly unlikely outcome. Therefore X.

Example:

  1. Claim: there is no smallest rational (can be represented as a fraction \(\frac{g}{n}\)) positive number

  2. Assume Contradiction: there is a smallest rational number \(q\)

  3. RAA: since \(q\) is positive, \(\frac{q}{2}\) is a rational number (literally repping it with a fraction rn) and \(\frac{q}{2} \lt q\)

  4. Conclusion: there is no smallest rational number

Reductio Ad Absurdum

Claim: there is no town with a local barber who shaves all and only those who do not shave themselves.

Prove this with RAA.

Null Hypothesis Significance Testing

  • Null Hypothesis: Any hypothesis of “no effect”

  • Significance Testing: Evaluating the likelihood of a hypothesis

Null Hypotheses

  • The regression coefficient of IQ’s effect on Income is 0 \(\beta_{iq} = 0\)

  • The mean difference between the GPA of EECS and CADS students is 0 \(d_{e-c} = 0\)

  • The proportion of heads on this coin is 0.5 \(p = 0.5\)

All of the above assume “no effect”

Test Statistics

Test-Statistic: a summary of the data calculated using a sample. \(f(x)\)

Example: z-statistic, \(\frac{observed - expected}{se}\)

P-Values

First of all, I am very proud to have taught you this much about Statistical Inference without once mentioning p-values. But alas. It is time.

P-Values

P-values: \(p(\text{data} \mid H_0)\); assuming the null is true and there’s no effect, what is the probability of observing a test-statistic as or more extreme than the one we calculated from our data.

Null Sampling Distribution

Notice, this Sampling distribution is not centered around \(\hat{\theta}\), but our null value \(\theta_0\). The standard error is still calculated as \(\frac{\sigma}{\sqrt{n}}\). Defines the sample estimates \(\hat{\theta}\) we’d expect if we repeatedly sampled from the null.

P-Values

P-values: \(p(\text{data} \mid H_0)\); assuming the null is true and there’s no effect, what is the probability of observing a test-statistic as or more extreme than the one we calculated from our data.

Directional P-Values

  • Directional Null: \(\mu \geq 0\)

  • Non-Directional Null: \(\mu = 0\)

Fisherian Hypothesis Testing

💡 p-values are a continuous measure of evidence against \(H_0\)

❓ answers the question “is the observed data consistent with \(H_0\)

Fisherian Hypothesis Testing

  1. Choose an appropriate test

  2. Define \(H_0\)

  3. Calculate p-value

  4. Assess Significance

    • the lower the p-value, the stronger the evidence against the null

Fisherian Hypothesis Testing

We’re testing the hypothesis that SuperSmartizine™️ increases IQ. We give SuperSmartizine™️ to 25 people and measured their IQs. The mean IQ is 104.5

  1. Choose an appropriate test:z-test

  2. Define \(H_0\): \(\mu_{ss} = 100\)

  3. Calculate p-value: \(\mathcal{N}(0, \frac{15}{\sqrt{25}})\), \(p(104.5 \mid \mu_{ss} = 100) \approx 0.09\)

  4. Assess Significance: this is not strong evidence against the null. We’d expect sample means as or more extreme than this about 10% of the time under repeated samples from the null

Fisherian Hypothesis Testing

We’re testing the hypothesis that the effect of minutes of exercise on heart attack \(\beta_{exercise}\) is different than 0. We collect 1000 data points, and fit a logistic regression model \(\text{heart_attack} \sim \text{age} + \text{sex} + \text{exercise_minutes}\). The coefficient is: \(-0.002505\)

  1. Choose an appropriate test:t-test

  2. Define \(H_0\): \(\beta_{exercise} = 0\)

  3. Calculate p-value: computer does this for us tbh; \(0.000747\)

  4. Assess Significance: this is strong evidence against the null. We’d expect sample means as or more extreme than this about 0.0747% of the time under repeated samples from the null.

Fisherian Hypothesis Testing

lr <- glm(heart_attack ~ age + sex + exercise_minutes, 
    data = data,
    family = binomial)
summary(lr)

Call:
glm(formula = heart_attack ~ age + sex + exercise_minutes, family = binomial, 
    data = data)

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)       0.3199939  0.2798233   1.144 0.252807    
age               0.0043040  0.0045122   0.954 0.340157    
sex               0.2871836  0.1294311   2.219 0.026499 *  
exercise_minutes -0.0025059  0.0007432  -3.372 0.000747 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1361.2  on 999  degrees of freedom
Residual deviance: 1344.9  on 996  degrees of freedom
AIC: 1352.9

Number of Fisher Scoring iterations: 4

Studies in Crop Variation(A)Nova

Statistical Methods for Research Workers

The value for which P=0.05, or 1 in 20, is 1.96 or nearly 2; it is convenient to take this point as a limit in judging whether a deviation ought to be considered significant or not. Deviations exceeding twice the standard deviation are thus formally regarded as significant. Using this criterion we should be led to follow up a false indication only once in 22 trials, even if the statistics were the only guide available. Small effects will still escape notice if the data are insufficiently numerous to bring them out, but no lowering of the standard of significance would meet this difficulty.

Justify Your \(\alpha\)

https://osf.io/preprints/psyarxiv/ts4r6

https://osf.io/preprints/psyarxiv/ts4r6

Neyman-Pearson Significance Testing

💡 make a decision about whether you will act as if \(H_0\) is false while controlling your long run error rates

❓ answers the question “is the observed data extreme enough for us to reject \(H_0\)

Neyman-Pearson Significance Testing

  • Null Hypothesis: Any hypothesis of “no effect” (\(H_0\))

  • Alternative Hypothesis: The opposite of the Null, there is an effect (\(H_1\) or \(H_A\))

Fail to Reject H0 Reject H0
H0 True Correct Type I Error; FP
H1 True Type II Error; FN Correct

Neyman-Pearson Significance Testing

\(H_0: \mu > 0\)

Neyman-Pearson Significance Testing

  • Fail to Reject \(H_0\): we have not provided evidence that \(H_0\) is false, we will not act as if it’s false

  • Reject \(H_0\): we have provided evidence that \(H_0\) is false, we will act as if it’s false

Neyman-Pearson Significance Testing

These four outcomes all have defined probabilities.

Fail to Reject H0 Reject H0
H0 True \(1- \alpha\) \(\alpha\)
H1 True \(\beta\) \(1-\beta\)

Remember: we get to choose \(\alpha\) directly

Neyman-Pearson Significance Testing

  1. Choose an appropriate test

  2. Define \(H_0\) and \(H_A\)

  3. Calculate test-statistic and critical value

  4. Assess Significance

    • if our test statistic is more extreme than our critical value, we will act as if the \(H_0\) is false

Neyman-Pearson Significance Testing

We are testing the hypothesis that the sample proportion of Chapman students who voted is different than the US proportion or \(0.66\). We polled 100 Chapman students and 75% (0.75) of them voted.

  1. Choose an appropriate test: one sample z-test for proportions

  2. Define \(H_0\) and \(H_A\):

    • \(H_0\): \(p_{chap} = 0.66\)

    • \(H_A\): \(p_{chap} \neq 0.66\)

  3. Calculate test-statistic and critical value: \(z = \frac{0.75-0.66}{se} = 1.9\); critical value is \(1.96\) when \(\alpha = 0.05\)

  4. Assess Significance: we fail to reject \(H_0\), and will not act as if \(H_0\) is false.

Z-statistic: 1.899901 
P-value: 0.05744605 

Neyman-Pearson Significance Testing

We are testing the hypothesis that the coefficient for the effect of age on stress levels \(\beta_{age}\) is not 0.

  1. Choose an appropriate test: t-test

  2. Define \(H_0\) and \(H_A\)

    • \(H_0: \beta_{age} = 0\)

    • \(H_A: \beta_{age} \neq 0\)

  3. Calculate test-statistic and critical value: t-statistic = \(2.465\), critical value with \(\alpha = 0.05\) and \(df = 98\) is \(1.984\)

  4. Assess Significance: We reject \(H_0\), we will act as if \(H_0\) is false, and assume \(\beta_{age} \neq 0\)

Neyman-Pearson Significance Testing

# Run linear regression model
model <- lm(stress_level ~ age,
            data = data)
# Summarize the model
summary(model)

Call:
lm(formula = stress_level ~ age, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-22.357  -6.099  -0.210   5.956  22.156 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  30.3127     3.1666   9.573 1.02e-15 ***
age           0.1795     0.0728   2.465   0.0154 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.692 on 98 degrees of freedom
Multiple R-squared:  0.0584,    Adjusted R-squared:  0.04879 
F-statistic: 6.078 on 1 and 98 DF,  p-value: 0.01543

Power Analysis

If there is an effect, how likely are you to detect it (\(\beta\))?

Fail to Reject H0 Reject H0
H0 True \(1-\alpha\) Correct \(\alpha\) Type I Error
H1 True \(\beta\) Type II Error \(1-\beta\) Power

Power Analysis

Power Analysis

❓ What are things we could change that would increase our statistical power?

Power Analysis

  • sample size

  • population standard deviation

  • effect size

  • \(\alpha\)

Power Analysis: \(n\)

Power Analysis: \(\alpha\)

Power Analysis: Effect Size

Power Analysis: \(\sigma\)